Goto

Collaborating Authors

 anchor data


GRAID: Synthetic Data Generation with Geometric Constraints and Multi-Agentic Reflection for Harmful Content Detection

Rad, Melissa Kazemi, Purpura, Alberto, Kumar, Himanshu, Chen, Emily, Sorower, Mohammad Shahed

arXiv.org Artificial Intelligence

We address the problem of data scarcity in harmful text classification for guardrailing applications and introduce GRAID (Geometric and Reflective AI-Driven Data Augmentation), a novel pipeline that leverages Large Language Models (LLMs) for dataset augmentation. GRAID consists of two stages: (i) generation of geometrically controlled examples using a constrained LLM, and (ii) augmentation through a multi-agentic reflective process that promotes stylistic diversity and uncovers edge cases. This combination enables both reliable coverage of the input space and nuanced exploration of harmful content. Using two benchmark data sets, we demonstrate that augmenting a harmful text classification dataset with GRAID leads to significant improvements in downstream guardrail model performance.


Sparse mixed linear modeling with anchor-based guidance for high-entropy alloy discovery

Murakami, Ryo, Miura, Seiji, Endo, Akihiro, Minamoto, Satoshi

arXiv.org Machine Learning

REGULAR ARTICLE Sparse mixed linear modeling with anchor-based guidance for high-entropy alloy discovery Ryo Murakami a, Seiji Miura b, Akihiro Endo a and Satoshi Minamoto a a Materials Data Platform, Research Network and Facility Services Division, National Institute for Materials Science, Tsukuba 305-0044, Ibaraki, Japan b Division of Materials Science and Engineering, Faculty of Engineering, Hokkaido University, Sapporo 060-8628, Hokkaido, Japan ARTICLE HISTORY Compiled April 30, 2025 ABSTRACT High-entropy alloys have attracted attention for their exceptional mechanical properties and thermal stability. To solve this problem, machine learning techniques have been increasingly employed for property prediction and high-throughput screening. Nevertheless, highly accurate nonlinear models often suffer from a lack of interpretability, which is a major limitation. In this study, we focus on local data structures that emerge from the greedy search behavior inherent to experimental data acquisition. By introducing a linear and low-dimensional mixture regression model, we strike a balance between predictive performance and model interpretability. In addition, we develop an algorithm that simultaneously performs prediction and feature selection by considering multiple candidate descriptors. Through a case study on high-entropy alloys, this study introduces a method that combines anchor-guided clustering and sparse linear modeling to address biased data structures arising from greedy exploration in materials science. KEYWORDS Sparse modeling; Mixed linear model; Bayesian inference; Materials informatics; Data-driven science; High-entropy alloys 1. Introduction In recent years, high-entropy alloys (HEAs) have garnered attention as next-generation materials for their outstanding mechanical properties, thermal stability, and corrosion resistance [1,2]. Unlike conventional alloy designs, HEAs--also referred to as multi-principal element alloys--comprise multiple (typically five or more) principal elements, offering a high degree of chemical and structural freedom. This unique composition enables the exploration of novel properties unattainable in traditional materials systems.


OpenUAS: Embeddings of Cities in Japan with Anchor Data for Cross-city Analysis of Area Usage Patterns

Tamura, Naoki, Shoji, Kazuyuki, Katayama, Shin, Urano, Kenta, Yonezawa, Takuro, Kawaguchi, Nobuo

arXiv.org Artificial Intelligence

We publicly release OpenUAS, a dataset of area embeddings based on urban usage patterns, including embeddings for over 1.3 million 50-meter square meshes covering a total area of 3,300 square kilometers. This dataset is valuable for analyzing area functions in fields such as market analysis, urban planning, transportation infrastructure, and infection prediction. It captures the characteristics of each area in the city, such as office districts and residential areas, by employing an area embedding technique that utilizes location information typically obtained by GPS. Numerous area embedding techniques have been proposed, and while the public release of such embedding datasets is technically feasible, it has not been realized. One of the obstacles has been the integration of data from different cities and periods into a unified space without sharing raw location data. We address this issue by developing an anchoring method that establishes anchors within a shared embedding space. We publicly release this anchor dataset along with area embedding datasets from several periods in eight major Japanese cities. This dataset allows users to analyze urban usage patterns in Japanese cities and embed their urban dataset into the same embedding space using the anchoring method. Our key contributions include the development of the anchoring method, releasing area embedding datasets for Japanese cities, and providing tools for effective data utilization.


New Solutions Based on the Generalized Eigenvalue Problem for the Data Collaboration Analysis

Kawakami, Yuta, Takano, Yuichi, Imakura, Akira

arXiv.org Artificial Intelligence

In recent years, the accumulation of data across various institutions has garnered attention for the technology of confidential data analysis, which improves analytical accuracy by sharing data between multiple institutions while protecting sensitive information. Among these methods, Data Collaboration Analysis (DCA) is noted for its efficiency in terms of computational cost and communication load, facilitating data sharing and analysis across different institutions while safeguarding confidential information. However, existing optimization problems for determining the necessary collaborative functions have faced challenges, such as the optimal solution for the collaborative representation often being a zero matrix and the difficulty in understanding the process of deriving solutions. This research addresses these issues by formulating the optimization problem through the segmentation of matrices into column vectors and proposing a solution method based on the generalized eigenvalue problem. Additionally, we demonstrate methods for constructing collaborative functions more effectively through weighting and the selection of efficient algorithms suited to specific situations. Experiments using real-world datasets have shown that our proposed formulation and solution for the collaborative function optimization problem achieve superior predictive accuracy compared to existing methods.


Enhancing Data Quality in Federated Fine-Tuning of Foundation Models

Zhao, Wanru, Du, Yaxin, Lane, Nicholas Donald, Chen, Siheng, Wang, Yanfeng

arXiv.org Artificial Intelligence

The PubMedQA task is designed to answer research questions with responses categorized as yes/no/maybe, effectively framing it as a multiple-choice question format. The dataset is divided into three subsets: 1,000 manually labeled question-answer pairs (denoted as PQA-L), 61,200 unlabeled pairs (PQA-U), and 211,300 pairs that have been artificially generated (PQA-A). Consistent with previous studies (Diao et al., 2023; Singhal et al., 2023), we employ the PQA-L subset as the test set for evaluating the model's performance. USMLE USMLE (Jin et al., 2021) consists of multiple-choice questions (with 4 choices per question) that are based on the United States Medical Licensing Exams. This dataset has been compiled from questions used in professional medical board examinations and is unique in its multilingual composition, including English, Simplified Chinese, and Traditional Chinese versions. It contains 12,724 questions in English, 34,251 in Simplified Chinese, and 14,123 in Traditional Chinese. For our purposes, we focus on the English component of the dataset, which is further divided into 10,178 questions for the training set, 1,273 for the validation set, and 1,273 for the test set, adhering to the official distribution of the dataset.


FedAnchor: Enhancing Federated Semi-Supervised Learning with Label Contrastive Loss for Unlabeled Clients

Qiu, Xinchi, Gao, Yan, Sani, Lorenzo, Pan, Heng, Zhao, Wanru, Gusmao, Pedro P. B., Alibeigi, Mina, Iacob, Alex, Lane, Nicholas D.

arXiv.org Artificial Intelligence

Federated learning (FL) is a distributed learning paradigm that facilitates collaborative training of a shared global model across devices while keeping data localized. The deployment of FL in numerous real-world applications faces delays, primarily due to the prevalent reliance on supervised tasks. Generating detailed labels at edge devices, if feasible, is demanding, given resource constraints and the imperative for continuous data updates. In addressing these challenges, solutions such as federated semi-supervised learning (FSSL), which relies on unlabeled clients' data and a limited amount of labeled data on the server, become pivotal. In this paper, we propose FedAnchor, an innovative FSSL method that introduces a unique double-head structure, called anchor head, paired with the classification head trained exclusively on labeled anchor data on the server. The anchor head is empowered with a newly designed label contrastive loss based on the cosine similarity metric. Our approach mitigates the confirmation bias and overfitting issues associated with pseudo-labeling techniques based on high-confidence model prediction samples. Extensive experiments on CIFAR10/100 and SVHN datasets demonstrate that our method outperforms the state-of-the-art method by a significant margin in terms of convergence rate and model accuracy.


ViT-Lens-2: Gateway to Omni-modal Intelligence

Lei, Weixian, Ge, Yixiao, Yi, Kun, Zhang, Jianfeng, Gao, Difei, Sun, Dylan, Ge, Yuying, Shan, Ying, Shou, Mike Zheng

arXiv.org Artificial Intelligence

Aiming to advance AI agents, large foundation models significantly improve reasoning and instruction execution, yet the current focus on vision and language neglects the potential of perceiving diverse modalities in open-world environments. However, the success of data-driven vision and language models is costly or even infeasible to be reproduced for rare modalities. In this paper, we present ViT-Lens-2 that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning them to a pre-defined space. Specifically, the modality-specific lens is tuned to project any-modal signals to an intermediate embedding space, which are then processed by a strong ViT with pre-trained visual knowledge. The encoded representations are optimized toward aligning with the modal-independent space, pre-defined by off-the-shelf foundation models. ViT-Lens-2 provides a unified solution for representation learning of increasing modalities with two appealing advantages: (i) Unlocking the great potential of pretrained ViTs to novel modalities effectively with efficient data regime; (ii) Enabling emergent downstream capabilities through modality alignment and shared ViT parameters. We tailor ViT-Lens-2 to learn representations for 3D point cloud, depth, audio, tactile and EEG, and set new state-of-the-art results across various understanding tasks, such as zero-shot classification. By seamlessly integrating ViT-Lens-2 into Multimodal Foundation Models, we enable Any-modality to Text and Image Generation in a zero-shot manner. Code and models are available at https://github.com/TencentARC/ViT-Lens.


Data Collaboration Analysis applied to Compound Datasets and the Introduction of Projection data to Non-IID settings

Mizoguchi, Akihiro, Bogdanova, Anna, Imakura, Akira, Sakurai, Tetsuya

arXiv.org Artificial Intelligence

Given the time and expense associated with bringing a drug to market, numerous studies have been conducted to predict the properties of compounds based on their structure using machine learning. Federated learning has been applied to compound datasets to increase their prediction accuracy while safeguarding potentially proprietary information. However, federated learning is encumbered by low accuracy in not identically and independently distributed (non-IID) settings, i.e., data partitioning has a large label bias, and is considered unsuitable for compound datasets, which tend to have large label bias. To address this limitation, we utilized an alternative method of distributed machine learning to chemical compound data from open sources, called data collaboration analysis (DC). We also proposed data collaboration analysis using projection data (DCPd), which is an improved method that utilizes auxiliary PubChem data. This improves the quality of individual user-side data transformations for the projection data for the creation of intermediate representations. The classification accuracy, i.e., area under the curve in the receiver operating characteristic curve (ROC-AUC) and AUC in the precision-recall curve (PR-AUC), of federated averaging (FedAvg), DC, and DCPd was compared for five compound datasets. We determined that the machine learning performance for non-IID settings was in the order of DCPd, DC, and FedAvg, although they were almost the same in identically and independently distributed (IID) settings. Moreover, the results showed that compared to other methods, DCPd exhibited a negligible decline in classification accuracy in experiments with different degrees of label bias. Thus, DCPd can address the low performance in non-IID settings, which is one of the challenges of federated learning.


Achieving Transparency in Distributed Machine Learning with Explainable Data Collaboration

Bogdanova, Anna, Imakura, Akira, Sakurai, Tetsuya, Fujii, Tomoya, Sakamoto, Teppei, Abe, Hiroyuki

arXiv.org Artificial Intelligence

Transparency of Machine Learning models used for decision support in various industries becomes essential for ensuring their ethical use. To that end, feature attribution methods such as SHAP (SHapley Additive exPlanations) are widely used to explain the predictions of black-box machine learning models to customers and developers. However, a parallel trend has been to train machine learning models in collaboration with other data holders without accessing their data. Such models, trained over horizontally or vertically partitioned data, present a challenge for explainable AI because the explaining party may have a biased view of background data or a partial view of the feature space. As a result, explanations obtained from different participants of distributed machine learning might not be consistent with one another, undermining trust in the product. This paper presents an Explainable Data Collaboration Framework based on a model-agnostic additive feature attribution algorithm (KernelSHAP) and Data Collaboration method of privacy-preserving distributed machine learning. In particular, we present three algorithms for different scenarios of explainability in Data Collaboration and verify their consistency with experiments on open-access datasets. Our results demonstrated a significant (by at least a factor of 1.75) decrease in feature attribution discrepancies among the users of distributed machine learning.


Another Use of SMOTE for Interpretable Data Collaboration Analysis

Imakura, Akira, Kihira, Masateru, Okada, Yukihiko, Sakurai, Tetsuya

arXiv.org Artificial Intelligence

Recently, data collaboration (DC) analysis has been developed for privacy-preserving integrated analysis across multiple institutions. DC analysis centralizes individually constructed dimensionality-reduced intermediate representations and realizes integrated analysis via collaboration representations without sharing the original data. To construct the collaboration representations, each institution generates and shares a shareable anchor dataset and centralizes its intermediate representation. Although, random anchor dataset functions well for DC analysis in general, using an anchor dataset whose distribution is close to that of the raw dataset is expected to improve the recognition performance, particularly for the interpretable DC analysis. Based on an extension of the synthetic minority over-sampling technique (SMOTE), this study proposes an anchor data construction technique to improve the recognition performance without increasing the risk of data leakage. Numerical results demonstrate the efficiency of the proposed SMOTE-based method over the existing anchor data constructions for artificial and real-world datasets. Specifically, the proposed method achieves 9 percentage point and 38 percentage point performance improvements regarding accuracy and essential feature selection, respectively, over existing methods for an income dataset. The proposed method provides another use of SMOTE not for imbalanced data classifications but for a key technology of privacy-preserving integrated analysis.